Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA
نویسندگان
چکیده
The optimization of matrix-matrix multiplication (MMM) performance has been well studied on general-purpose desktop and server processors. Classic solutions exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. Typical digital signal processors (DSPs) do not have these features, and instead use in-order execution, configurable memory hierarchies, and programmable I/O interfaces. We investigate the methods needed to achieve high performance MMM on the Texas Instruments C6713 floatingpoint DSP. This processor has two components that can be used to accelerate MMM: a software-managed memory hierarchy, and a direct memory access (DMA) engine that can perform block copies from main memory to into the memory hierarchy. Our MMM implementation overlaps computation with DMA block transfers. For matrices larger than the data caches, we observed a 46% performance increase over a blocked MMM implementation, and a 190% increase over the Texas Instruments DSP library.
منابع مشابه
Optimizing Matrix-matrix Multiplication for an Embedded Vliw Processor
The optimization of matrix-matrix multiplication (MMM) performance has been well studied on conventional general-purpose processors like the Intel Pentium 4. Fast algorithms, such as those in the Goto and ATLAS BLAS libraries, exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. However, the microarchitectur...
متن کاملSpace-time Tradeoos in Memory Hierarchies Space-time Tradeoos in Memory Hierarchies
The speed of CPUs is accelerating rapidly, outstripping that of peripheral storage devices and making it increasingly di cult to keep CPUs busy. Multilevel memory hierarchies, scaled to simulate single-level memories, are increasing in importance. In this paper we introduce the Memory Hierarchy Game, a multi-level pebble game simulating data movement in memory hierarchies for straight-line comp...
متن کاملHigh Performance Computing with the Cell Broadband Engine
The Cell Broadband Engine was conceived to enable the design of novel and highly efficient systems for compute-intensive applications. The Cell/B.E. departs from prior architectures by adopting a heterogeneous chip multiprocessor architecture with novel accelerator cores and an explicitly managed memory hierarchy. The increased computing density of the design improves peak performance as well a...
متن کاملOptimized Dense Matrix Multiplication on a Many-Core Architecture
Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of manycore-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures. In this pape...
متن کاملData prefetching for linear algebra operations on high performance workstations
In a previous work it was shown that the performance of linear algebra computations , which access large amounts of data, is dependent on the behavior of the memory hierarchy. This research is aimed to use the multilevel orthogonal blocking approach in conjuntion with other software techniques to further improve the performance of linear algebra computations. The performance of the dense matrix...
متن کامل